Wiktionary as a source for automatic pronunciation extraction
نویسندگان
چکیده
In this paper, we analyze whether dictionaries from the World Wide Web which contain phonetic notations, may support the rapid creation of pronunciation dictionaries within the speech recognition and speech synthesis system building process. As a representative dictionary, we selected Wiktionary [1] since it is at hand in multiple languages and, in addition to the definitions of the words, many phonetic notations in terms of the International Phonetic Alphabet (IPA) are available. Given word lists in four languages English, French, German, and Spanish, we calculated the percentage of words with phonetic notations in Wiktionary. Furthermore, two quality checks were performed: First, we compared pronunciations from Wiktionary to pronunciations from dictionaries based on the GlobalPhone project, which had been created in a rule-based fashion and were manually cross-checked [2]. Second, we analyzed the impact of Wiktionary pronunciations on automatic speech recognition (ASR) systems. French Wiktionary achieved the best pronunciation coverage, containing 92.58% phonetic notations for the French GlobalPhone word list as well as 76.12% and 30.16% for country and international city names. In our ASR systems evaluation, the Spanish system gained the most improvement from Wiktionary pronunciations with 7.22% relative word error rate reduction.
منابع مشابه
Automatic Error Recovery for Pronunciation Dictionaries
In this paper, we present our latest investigations on pronunciation modeling and its impact on ASR. We propose completely automatic methods to detect, remove, and substitute inconsistent or flawed entries in pronunciation dictionaries. The experiments were conducted on different tasks, namely (1) word-pronunciation pairs from the Czech, English, French, German, Polish, and Spanish Wiktionary [...
متن کاملTransformation of Wiktionary entry structure into tables and relations in a relational database schema
This paper addresses the question of automatic data extraction from the Wiktionary, which is a multilingual and multifunctional dictionary. Wiktionary is a collaborative project working on the same principles as the Wikipedia. The Wiktionary entry is a plain text from the text processing point of view. Wiktionary guidelines prescribe the entry layout and rules, which should be followed by edito...
متن کاملExtracting Lexical-Semantic Knowledge from the Portuguese Wiktionary
Public domain collaborative resources like Wiktionary and Wikipedia have recently become attractive sources for information extraction. To use these resources in natural languague processing (NLP) tasks, efficient programmatic access to their contents is required. In this work, we have extracted semantic relations automatically from the Portuguese Wiktionary and compared our results with the re...
متن کاملAutomatic Idiom Identification in Wiktionary
Online resources, such as Wiktionary, provide an accurate but incomplete source of idiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed ...
متن کاملVery-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms
Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation. However, it is designed primarily for human viewing rather than machine readability, and presents numerous challenges for generalized parsing and extraction due to a lack of stan...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010